54
Algorithms for Binary Neural Networks
3.5.5
Forward Propagation Based on Projection Convolution Layer
For each full precision kernel Cl
i, the corresponding quantized kernels ˆCl
i,j are concatenated
to construct the kernel Dl
i that actually participates in the convolution operation as
Dl
i = ˆCl
i,1 ⊕ˆCl
i,2 ⊕· · · ⊕ˆCl
i,J,
(3.45)
where ⊕denotes the concatenation operation on the tensors. In PCNNs, the projection
convolution is implemented based on Dl and F l to calculate the next layer’s feature map
F l+1.
F l+1 = Conv2D(F l, Dl),
(3.46)
where Conv2D is the traditional 2D convolution. Although our convolutional kernels are
3D-shaped tensors, we design the following strategy to fit the traditional 2D convolution as
F l+1
h,j =
i,h
F l
h ⊗Dl
i,j,
(3.47)
F l+1
h
= F l
h,1 ⊕· · · ⊕F l
h,J,
(3.48)
where ⊗denotes the convolutional operation. F l+1
h,j
is the jth channel of the hth feature
map at the (l + 1)th convolutional layer and F l
h denotes the hth feature map at the lth
convolutional layer. To be more precise, for example, when h = 1, for the jth channel of
an output feature map, F l+1
1,j is the sum of the convolutions between all the h input feature
maps and i corresponding quantized kernels. All channels of the output feature maps are
obtained as F l+1
h,1 , .., F l+1
h,j , ..., F l+1
h,J , and they are concatenated to construct the hth output
feature map F l+1
h
.
It should be emphasized that we can utilize multiple projections to increase the diversity
of convolutional kernels Dl. However, the single projection can perform much better than the
existing BNNs. The essential is the use of DBPP, which differs from [147] based on a single
quantization scheme. Within our convolutional scheme, there is no dimension disagreement
on feature maps and kernels in two successive layers. Thus, we can replace the traditional
convolutional layers with ours to binarize widely used networks, such as VGGs and ResNets.
At inference time, we only store the set of quantized kernels Dl
i instead of the full-precision
ones; that is, projection matrices W l
j are not used for inference, achieving a reduction in
storage.
3.5.6
Backward Propagation
According to Eq. 3.44, what should be learned and updated are the full-precision kernels
Cl
i and the projection matrix W l (
W l) using the updated equations described below.
Updating Cl
i: We define δCi as the gradient of the full-precision kernel Ci, and have
δCl
i = ∂L
∂Cl
i
= ∂LS
∂Cl
i
+ ∂LP
∂Cl
i
,
(3.49)
Cl
i ←Cl
i −η1δCl
i,
(3.50)
where η1 is the learning rate for the convolutional kernels. More specifically, for each item
in Eq. 3.49, we have
∂LS
∂Cl
i
=
J
j
∂LS
∂ˆCl
i,j
∂P l,j
ΩN (
W l
j, Cl
i)
∂(
W l
j ◦Cl
i)
∂(
W l
j ◦Cl
i)
∂Cl
i
=
J
j
∂LS
∂ˆCl
i,j
◦1−1≤
W l
j◦Cl
i≤1 ◦
W l
j,
(3.51)